Hash Embeddings for Efficient Word Representations
نویسندگان
چکیده
We present hash embeddings, an efficient method for representing words in a continuous vector form. A hash embedding may be seen as an interpolation between a standard word embedding and a word embedding created using a random hash function (the hashing trick). In hash embeddings each token is represented by k d-dimensional embeddings vectors and one k dimensional weight vector. The final d dimensional representation of the token is the product of the two. Rather than fitting the embedding vectors for each token these are selected by the hashing trick from a shared pool of B embedding vectors. Our experiments show that hash embeddings can easily deal with huge vocabularies consisting of millions of tokens. When using a hash embedding there is no need to create a dictionary before training nor to perform any kind of vocabulary pruning after training. We show that models trained using hash embeddings exhibit at least the same level of performance as models trained using regular embeddings across a wide range of tasks. Furthermore, the number of parameters needed by such an embedding is only a fraction of what is required by a regular embedding. Since standard embeddings and embeddings constructed using the hashing trick are actually just special cases of a hash embedding, hash embeddings can be considered an extension and improvement over the existing regular embedding types.
منابع مشابه
SPINE: SParse Interpretable Neural Embeddings
Prediction without justification has limited utility. Much of the success of neural models can be attributed to their ability to learn rich, dense and expressive representations. While these representations capture the underlying complexity and latent trends in the data, they are far from being interpretable. We propose a novel variant of denoising k-sparse autoencoders that generates highly ef...
متن کاملSiamese CBOW: Optimizing Word Embeddings for Sentence Representations
We present the Siamese Continuous Bag of Words (Siamese CBOW) model, a neural network for efficient estimation of highquality sentence embeddings. Averaging the embeddings of words in a sentence has proven to be a surprisingly successful and efficient way of obtaining sentence embeddings. However, word embeddings trained with the methods currently available are not optimized for the task of sen...
متن کاملSentiment Analysis by Joint Learning of Word Embeddings and Classifier
Word embeddings are representations of individual words of a text document in a vector space and they are often useful for performing natural language processing tasks. Current state of the art algorithms for learning word embeddings learn vector representations from large corpora of text documents in an unsupervised fashion. This paper introduces SWESA (Supervised Word Embeddings for Sentiment...
متن کاملOn Approximately Searching for Similar Word Embeddings
We discuss an approximate similarity search for word embeddings, which is an operation to approximately find embeddings close to a given vector. We compared several metric-based search algorithms with hash-, tree-, and graphbased indexing from different aspects. Our experimental results showed that a graph-based indexing exhibits robust performance and additionally provided useful information, ...
متن کاملUnsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
The recent tremendous success of unsupervised word embeddings in a multitude of applications raises the obvious question if similar methods could be derived to improve embeddings (i.e. semantic representations) of word sequences as well. We present a simple but efficient unsupervised objective to train distributed representations of sentences. Our method outperforms the state-of-the-art unsuper...
متن کامل